Lab1 tutorial

Yue You

2021-06-17

Applied Data Science (MAST30034)

Welcome

Welcome to Applied Data Science for 2021 Semester 2!

This is a capstone project subject, hence, expectations are higher than most other subjects that you will take in your undergraduate course. It is expected that students have already completed assessments to a satisfactory level for the following subjects:

Elements of Data Processing (COMP20008) Statistics (MAST20005) Machine Leaarning (COMP30027) Linear Statistical Models (MAST30025) If you are unfamiliar with GitHub, it is in your best interest to revise how to use it or attend a consultation / revision workshop to learn.

Teaching Team

Your teaching staff will be as follows:

Lecturer: Dr. Karim Seghouane (Assignment 1) Subject Coordinator: Akira Wang (Project 1 and 2) Tutor: Yue You (TBA) ### Tutorial Structure Tutorials are broken into Python and R streams to support students in whichever language they prefer. The first hour of the tutorial will be based on general programming how-to’s and walkthroughs. The remainder of the tutorial will generally follow a consultation / free-for-all style. That is, we can cover a topic of request out of the Advanced Tutorials module, answer project related questions, or ask questions about industry / applying for jobs. You are free to attend any tutorial time, either half (or the full 2 hours) of the tutorial depending on your interests. You are all classified as experienced university veterans so do what works for you. Finally, tutorial attendence is not marked for the duration of Project 1 and Assignment 1, but there is an expectation that you attend tutorials with your group for Project 2.

Lab 1 Overview

First Half

Using the R server:

Using GitHub Desktop vs Git CLI (Command Line Interface):

Create a repository for your Project 1, push a commit, and ensure your repository accepts the changes.

Project 1 Tips: How to get started and what to look out for. Getting started on Latex with Overleaf.

Second Half

Revision:

Variable names, magic numbers, and constants. Docstrings and comments. Plotting geospatial maps the correct way. Jupyter Notebook Magic Cells. Data Serialisation. Downloading files using urllib. Advanced (Optional):

(Windows 10 Users) Installing WSL2 (Ubuntu 20.04) for a clean environment. Introduction to Apache Spark 3.0

General Tips for R markdown

Using git on the VM

https://mast30034.science.unimelb.edu.au/

Cloning:

Open a terminal (yes it is commandline git for this to work). git clone HTTPS (where HTTPS is the https url to your gitlab repo). Enter your credentials. Done.

Pushing:

Change directories to inside your repository (cd NAME_OF_REPO_FOLDER). git add . (this will add all files in the current directory to a commit - you can specify specific files if you would like instead). git commit -m “message” (make a commit with a message). git push Enter your credentials. Done.

Readable Code

We will be assessing the quality of your code and how you present it in your notebooks. This is because there is no point writing code that cannot be easily interpreted. At the end of the day, clients are paying for your analysis, but also the corresponding code. If your code is confusing or difficult to read, there is little chance your client will come back to you.

Variable Names

As long as you are consistent, then it is fine. For example, commit to either using:

Snake Case: words are seperated by underscores such as variable_name Camel Case: words are seperated by captials such as variableName Your variables should be contextual and describe the code. That is, try to name your variables to be understandable without comments.

Comments and Docstrings (w.r.t JupyterNotebook Cells)

Cells in Jupyter Notebook should aim to do one “block of logic” at a time (i.e importing libraries, defining functions, filtering rows, etc).

If it takes a reader more than a few seconds to understand your cell, you need comments. Your functions need to have docstrings describing what they do.

Let’s start!

Install t-map package (Library for thematic maps) and other required R packages

#install.packages("dplyr")
#install.packages("sf")
#install.packages("curl")
#Restart your R Session
#install.packages("tmap")

Install ggmap

#install.packages("ggmap")

#OR (choose whichever works on your computer)

#install.packages("devtools")
#devtools::install_github("dkahle/ggmap")

Load libraries

library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)

Read in the data

df = read.csv("/Volumes/you.y/MAST30034_R/data/sample.csv",stringsAsFactors = TRUE)
head(df)
##   VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1        2         1/12/15 0:00          1/12/15 0:05               5
## 2        2         1/12/15 0:00          1/12/15 0:00               2
## 3        2         1/12/15 0:00          1/12/15 0:00               1
## 4        1         1/12/15 0:00          1/12/15 0:05               1
## 5        1         1/12/15 0:00          1/12/15 0:09               2
## 6        1         1/12/15 0:00          1/12/15 0:16               1
##   trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag
## 1          0.96        -73.97994        40.76538          1                  N
## 2          2.69        -73.97234        40.76238          1                  N
## 3          2.62        -73.96885        40.76453          1                  N
## 4          1.20        -73.99393        40.74168          1                  N
## 5          3.00        -73.98892        40.72699          1                  N
## 6          6.30        -73.97408        40.76291          1                  N
##   dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax
## 1         -73.96631         40.76309            1         5.5   0.5     0.5
## 2         -73.99363         40.74600            1        21.5   0.0     0.5
## 3         -73.97455         40.79164            1        17.0   0.0     0.5
## 4         -73.99767         40.74747            1         6.5   0.5     0.5
## 5         -73.97559         40.69687            2        11.0   0.5     0.5
## 6         -74.01280         40.70221            1        20.5   0.5     0.5
##   tip_amount tolls_amount improvement_surcharge total_amount
## 1       1.00            0                   0.3         7.80
## 2       3.34            0                   0.3        25.64
## 3       3.56            0                   0.3        21.36
## 4       0.20            0                   0.3         8.00
## 5       0.00            0                   0.3        12.30
## 6       4.35            0                   0.3        26.15

Check the dimensions of the dataset.

dim(df)
## [1] 100000     19
colnames(df)
##  [1] "VendorID"              "tpep_pickup_datetime"  "tpep_dropoff_datetime"
##  [4] "passenger_count"       "trip_distance"         "pickup_longitude"     
##  [7] "pickup_latitude"       "RatecodeID"            "store_and_fwd_flag"   
## [10] "dropoff_longitude"     "dropoff_latitude"      "payment_type"         
## [13] "fare_amount"           "extra"                 "mta_tax"              
## [16] "tip_amount"            "tolls_amount"          "improvement_surcharge"
## [19] "total_amount"
typeof(df)
## [1] "list"
str(df)
## 'data.frame':    100000 obs. of  19 variables:
##  $ VendorID             : int  2 2 2 1 1 1 2 2 2 2 ...
##  $ tpep_pickup_datetime : Factor w/ 237 levels "1/12/15 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tpep_dropoff_datetime: Factor w/ 536 levels "1/12/15 0:00",..: 5 1 1 5 9 16 2 8 17 10 ...
##  $ passenger_count      : int  5 2 1 1 2 1 6 2 1 2 ...
##  $ trip_distance        : num  0.96 2.69 2.62 1.2 3 6.3 0.63 1.91 4.5 1.42 ...
##  $ pickup_longitude     : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude      : num  40.8 40.8 40.8 40.7 40.7 ...
##  $ RatecodeID           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dropoff_longitude    : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude     : num  40.8 40.7 40.8 40.7 40.7 ...
##  $ payment_type         : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ fare_amount          : num  5.5 21.5 17 6.5 11 20.5 4 8 16.5 8.5 ...
##  $ extra                : num  0.5 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  1 3.34 3.56 0.2 0 4.35 1.06 1.86 3.56 2.45 ...
##  $ tolls_amount         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  7.8 25.6 21.4 8 12.3 ...
summary(df[,c('pickup_latitude', 'pickup_longitude')])
##  pickup_latitude pickup_longitude
##  Min.   : 0.00   Min.   :-77.05  
##  1st Qu.:40.73   1st Qu.:-73.99  
##  Median :40.75   Median :-73.98  
##  Mean   :40.14   Mean   :-72.88  
##  3rd Qu.:40.77   3rd Qu.:-73.97  
##  Max.   :42.74   Max.   :  0.00
which(is.na(df['pickup_longitude']))
## integer(0)
summary(df[,c('pickup_latitude', 'pickup_longitude')])[c(2,5),]
##  pickup_latitude pickup_longitude
##  1st Qu.:40.73   1st Qu.:-73.99  
##  3rd Qu.:40.77   3rd Qu.:-73.97

Filtering and variable selection

head(df %>% 
  filter(VendorID == 1) %>% 
  select(trip_distance, pickup_longitude,pickup_latitude))
##   trip_distance pickup_longitude pickup_latitude
## 1           1.2        -73.99393        40.74168
## 2           3.0        -73.98892        40.72699
## 3           6.3        -73.97408        40.76291
## 4           0.0        -73.99016        40.75620
## 5           1.0        -73.99577        40.74379
## 6           1.8        -73.98841        40.76442
head(df %>% 
  filter(VendorID == 1 & passenger_count > 0) %>% 
  select(trip_distance, pickup_longitude,pickup_latitude))
##   trip_distance pickup_longitude pickup_latitude
## 1           1.2        -73.99393        40.74168
## 2           3.0        -73.98892        40.72699
## 3           6.3        -73.97408        40.76291
## 4           0.0        -73.99016        40.75620
## 5           1.0        -73.99577        40.74379
## 6           1.8        -73.98841        40.76442
df %>% 
  filter(VendorID == 2 & passenger_count > 100) %>% 
  select(trip_distance, pickup_longitude,pickup_latitude)
## [1] trip_distance    pickup_longitude pickup_latitude 
## <0 rows> (or 0-length row.names)

Download and view map

map<-get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Manhattan")$bbox))), zoom = 11)
ggmap(map)

For your Project 1, we will be working with NYC. Here’s one method of setting it up.

range(df %>% select(pickup_longitude) %>% filter(!pickup_longitude==0)) -> xranges
range(df %>% select(pickup_latitude) %>% filter(!pickup_latitude==0)) -> yranges
xranges
## [1] -77.04710 -71.06483
yranges
## [1] 37.27044 42.73614

Plot pickup locations

ggmap(map) + 
  geom_point(data = df,
             aes(x = pickup_longitude, y = pickup_latitude),
             colour="white", size = 0.01,alpha = .5)